How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

نویسندگان

  • Wilmer Leal
  • Eugenio-José Llanos
  • Guillermo Restrepo
  • Carlos F. Suárez
  • Manuel-Elkin Patarroyo
چکیده

BACKGROUND Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. RESULTS Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. CONCLUSIONS Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.Graphical abstractFour cluster contrast functions identifying statistically sound clusters within dendrograms considering ties in proximity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members

Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...

متن کامل

Studying of Research Related to COVID-19 Vaccine in Iran and the World: A Thematic Analysis and Scientific Collaborations

Background and Objective: The purpose of the present study is thematic analysis and scientific collaborations of research related to Covid 19 vaccine in Iran and the world based on scientific products indexed in Web of Science (WOS). Materials and Methods: The present study is a descriptive-analytical study with a scientometric approach and using the methods of content analysis and techniques ...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

Assessment of Clustering Methods for Predicting Permeability in a Heterogeneous Carbonate Reservoir

Permeability, the ability of rocks to flow hydrocarbons, is directly determined from core. Due to high cost associated with coring, many techniques have been suggested to predict permeability from the easy-to-obtain and frequent properties of reservoirs such as log derived porosity. This study was carried out to put clustering methods (dynamic clustering (DC), ascending hierarchical clustering ...

متن کامل

تحلیل موضوعی مقالات مرتبط با اعتیاد در پایگاه مدلاین به روش خوشه بندی سلسله مراتبی: 2014-1991

Introduction: Addiction, which has recently attracted the attention of researchers, is a serious problem worldwide. The growth of relevant literature contributes to a better understanding of this problem and improves the interaction between executive organizations and academic institutions. It is important to identify the active subject areas within this field and to explore the topics which ar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2016